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Abstract. The exploration/exploitation (E/E) dilemma arises natu- 
rally in many subfields of Science. Multi-armed bandit problems for- 
malize this dilemma in its canonical form. Most current research in this 
field focuses on generic solutions that can be applied to a wide range 
of problems. However, in practice, it is often the case that a form of 
prior information is available about the specific class of target problems. 
Prior knowledge is rarely used in current solutions due to the lack of a 
systematic approach to incorporate it into the E/E strategy. 
To address a specific class of E/E problems, we propose to proceed in 
three steps: (i) model prior knowledge in the form of a probability distri- 
bution over the target class of E/E problems; (ii) choose a large hypoth- 
esis space of candidate E/E strategies; and (iii), solve an optimization 
problem to find a candidate E/E strategy of maximal average perfor- 
mance over a sample of problems drawn from the prior distribution. 
We illustrate this meta-learning approach with two different hypothe- 
sis spaces: one where E/E strategies are numerically parameterized and 
another where E/E strategies are represented as small symbolic formu- 
las. We propose appropriate optimization algorithms for both cases. Our 
experiments, with two-armed "Bernoulli" bandit problems and various 
playing budgets, show that the meta- learnt E/E strategies outperform 
generic strategies of the literature (UCBl, UCBI-Tuned, UCB-V, KL- 
UCB and e„-GREEDY); they also evaluate the robustness of the learnt 
E/E strategies, by tests carried out on arms whose rewards follow a 
truncated Gaussian distribution. 

Keywords: exploration-exploitation dilemma, prior knowledge, multi- 
armed bandit problems, reinforcement learning 

1 Introduction 

Exploration versus exploitation (E/E) dilemmas arise in many sub- fields of Sci- 
ence, and in related fields such as artificial intelligence, finance, medicine and 
engineering. In its most simple version, the multi-armed bandit problem formal- 
izes this dilemma as follows [Tj: a gambler has T coins, and at each step he 
may choose among one of K slots (or arms) to allocate one of these coins, and 
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then earns some money (his reward) depending on the response of the machine 
he selected. Each arm response is characterized by an unknown probabiUty dis- 
tribution that is constant over time. The goal of the gambler is to collect the 
largest cumulated reward once he has exhausted his coins (i.e. after T plays). 
A rational (and risk-neutral) gambler knowing the reward distributions of the 
K arms would play at every stage an arm with maximal expected reward, so 
as to maximize his expected cumulative reward (irrespectively of the number K 
of arms, his number T of coins, and the variances of the reward distributions). 
When reward distributions are unknown, it is less trivial to decide how to play 
optimally since two contradictory goals compete: exploration consists in trying 
an arm to acquire knowledge on its expected reward, while exploitation consists 
in using the current knowledge to decide which arm to play. How to balance 
the effort towards these two goals is the essence of the E/E dilemma, which is 
specially difficult when imposing a finite number of playing opportunities T. 

Most theoretical works about multi-armed bandit problem have focused on 
the design of generic E/E strategies which are provably optimal in asymptotic 
conditions (large T), while assuming only very unrestrictive conditions on the 
reward distributions (e.g., bounded support). Among these, some strategies work 
by computing at every play a quantity called "upper confidence index" for each 
arm that depends on the rewards collected so far by this arm, and by selecting 
for the next play (or round of plays) the arm with the highest index. Such E/E 
strategies are called index-based policies and have been initially introduced by 
[5] where the indices were difficult to compute. More easy to compute indices 
where proposed later on [31415) . 

Index-based policies typically involve hyper-parameters whose values impact 
their relative performances. Usually, when reporting simulation results, authors 
manually tuned these values on problems that share similarities with their test 
problems (e.g., the same type of distributions for generating the rewards) by 
running trial-and-error simulations jUBj. By doing so, they actually used prior 
information on the problems to select the hyper-parameters. 

Starting from these observations, we elaborated an approach for learning 
in a reproducible way good policies for playing multi-armed bandit problems 
over finite horizons. This approach explicitly models and then exploits the prior 
information on the target set of multi-armed bandit problems. We assume that 
this prior knowledge is represented as a distribution over multi-armed bandit 
problems, from which we can draw any number of training problems. Given this 
distribution, meta- learning consists in searching in a chosen set of candidate E/E 
strategies one that yields optimal expected performances. This approach allows 
to automatically tune hyper-parameters of existing index-based policies. But, 
more importantly, it opens the door for searching within much broader classes 
of E/E strategies one that is optimal for a given set of problems compliant 
with the prior information. We propose two such hypothesis spaces composed of 
index-based policies: in the first one, the index function is a linear function of 
features and whose meta-learnt parameters are real numbers, while in the second 
one it is a function generated by a grammar of symbolic formulas. 
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We empirically show, in the case of Bernoulli arms, that when the number K 
of arms and the playing horizon T are fully specified a priori, learning enables to 
obtain policies that significantly outperform a wide range of previously proposed 
generic policies (UCBl, UCBI-Tuned, UCB2, UCB-V, KL-UCB and e„- 
Greedy), even after careful tuning. We also evaluate the robustness of the 
learned policies with respect to erroneous prior assumptions, by testing the E/E 
strategies learnt for Bernoulli arms on bandits with rewards following a truncated 
Gaussian distribution. 

The ideas presented in this paper take their roots in two previously pub- 
lished papers. The idea of learning multi-armed bandit policies using global 
optimization and numerically parameterized index-based policies was first pro- 
posed in Searching good multi-armed bandit policies in a formula space was 
first proposed in 8 . Compared to this previous work, we adopt here a unifying 
perspective, which is the learning of E/E strategies from prior knowledge. We 
also introduce an improved optimization procedure for formula search, based on 
equivalence classes identification and on a pure exploration multi-armed problem 
formalization. 

This paper is structured as follows. We first formally define the multi-armed 
bandit problem and introduce index-based policies in Section 2. Section 3 for- 
mally states of E/E strategy learning problem. Section 4 and Section 5 present 
the numerical and symbolic instantiation of our learning approach, respectively. 
Section 6 reports on experimental results. Finally, we conclude and present future 
research directions in Section 7. 

2 Multi-armed bandit problem and policies 

We now formally describe the (discrete) multi-armed bandit problem and the 
class of index-based policies. 

2.1 The multi-armed bandit problem 

We denote by i € {1,2,..., K} the {K > 2) arms of the bandit problem, by i^i the 
reward distribution of arm i, and by fii its expected value; bt is the arm played 
at round t, and ~ ui,^ is the obtained reward. Ht = [bi,ri,b2,r2, ■ ■ ■ ,bt,rt] 
is a vector that gathers the history over the first t plays, and we denote by H 
the set of all possible histories of any length t. An E/E strategy (or policy) 
TT : T-L — >■ {1,2,..., K} is an algorithm that processes at play t the vector Ht^i 
to select the arm bt & {1,2, ... , K}: bt = n{Ht-i). 

The regret of the policy tt after T plays is defined by: i?^ = Tfi* — X^tli '"t) 
where ^* = max^ /ifc refers to the expected reward of the optimal arm. The 
expected value of the regret represents the expected loss due to the fact that the 
policy does not always play the best machine. It can be written as: 



K 




(1) 
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Algorithm 1 Generic index-based discrete bandit policy 

1: Given scoring function index : H x {1, 2, . . . , K} — > R, 

2: for f = 1 to do 

3: Play bandit bt =t > Initialization: play each bandit once 

4: Observe reward rt 

5: end for 

6: tort = KtoT do 

7: Play bandit ht = argmaxj.g|j^ 2 x} index{Hi_i,t) 
8: Observe reward rt 

9: end for 



where Tk{T) denotes the number of times the policy has drawn arm k on the 
first T rounds. 

The multi-armed bandit problem aims at finding a policy tt* that for a given 
K minimizes the expected regret (or, in other words, maximizes the expected 
reward), ideally for any T and any {vi}f^^. 

2.2 Index-based bandit policies 

Index-based bandit policies are based on a ranking index that computes for 
each arm k a numerical value based on the sub-history of responses iJ^'Li of that 
arm gathered at time t. These policies are sketched in Algorithm [T] and work as 
follows. During the first K plays, they play sequentially the machines 1, 2, . . . , ii' 
to perform initialization. In all subsequent plays, these policies compute for every 
machine k the score index{H^_i,t) e R that depends on the observed sub- 
history H^_^ of arm k and possibly on t. At each step t, the arm with the largest 
score is selected (ties are broken at random). 

Here are some examples of popular index functions: 



^ndex^^^^\HUt)=r, + J^ (2) 



zndex^^^'-^-^--{Ht„t) ^ f, + ^ min (l/4, a, + ^^-^) (3) 
zndexUCBi-^°™^^(i/ti, i) = ffe + \/l6^^^Pr°^7^ (4) 

^ndex^^---iHt,,t) = + + (5) 

where Tk and Wk are the mean and standard deviation of the rewards so far 
obtained from arm k and tk is the number of times it has been played. 

Policies UCBl, UCBI-Tuned and UCBl-NORMAiQ have been proposed 
by [3]. UCBl has one parameter C > whose typical value is 2. Policy UCB-V 

^ Note that this index-based policy does not strictly fit inside Algorithm [l] as it uses 
an additional condition to play bandits that were not played since a long time. 
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has been proposed by [5] and has two parameters C > and c > 0. We refer 
the reader to |4|5j for detailed explanations of these parameters. Note that these 
index function are the sum of an exploitation term to give preference on arms 
with high reward mean (f/j) and an exploration term that aims at playing arms 
to gather more information on their underlying reward distribution (which is 
typically an upper confidence term). 

3 Learning exploration/exploitation strategies 

Instead of relying on a fixed E/E strategy to solve a given class of problems, 
we propose a systematic approach to exploit prior knowledge by learning E/E 
strategies in a problem-driven way. We now state our learning approach in ab- 
stract terms. 

Prior knowledge is represented as a distribution Vp over bandit problems 
P — [vi^ . . . ,vk)- From this distribution, we can sample as many training prob- 
lems as desired. In order to learn E/E strategies exploiting this knowledge, we 
rely on a parametric family of candidate strategies Uq C {1,2,..., K}^ whose 
members are policies ne that are fully defined given parameters E O. Given 
TTq, the learning problem aims at solving: 

e* = argmin Ep^i,^{E{i?J^}} , (6) 

where E{i?pgi} is the expected cumulative regret of tt on problem P and where T 
is the (a-priori given) time playing horizon. Solving this minimization problem is 
non trivial since it involves an expectation over an infinite number of problems. 
Furthermore, given a problem P, computing E{i?py} relies on the expected 
values of Tfc(r), which we cannot compute exactly in the general case. Therefore, 
we propose to approximate the expected cumulative regret by the empirical mean 
regret over a finite set of training problems p(^\ . . . , p(^) from Vp: 

1 ^ 

e* = argmin A{^g) where A{it) = - V ^ , (7) 
eee ^ ~t ' 

and where -Rp*;) j, values are estimated performing a single trajectory of irg on 
problem P. Note that the number of training problems N will typically be large 
in order to make the variance A(-) reasonably small. 

In order to instantiate this approach, two components have to be provided: 
the hypothesis space Tie and the optimization algorithm to solve Eq. [7] The 
next two sections describe different instantiations of these components. 

4 Numeric parameterization 

We now instantiate our meta-learning approach by considering E/E strategies 
that have numerical parameters. 
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4.1 Policy search space 

To define the parametric family of candidate policies TTq , we use index functions 
expressed as linear combinations of history features. These index functions rely 
on an history feature function (j) : V. x {1,2, . . . , K} — > R'', that describes the 
history w.r.t. a given arm as a vector of scalar features. Given the function (/>(•, •), 
index functions are defined by 

indexg{Ht, k) = {9, (f>{Ht, k)) , 

where 9 S R'' are parameters and (•, •) is the classical dot product operator. The 
set of candidate policies lie is composed of all index-based policies obtained 
with such index functions given parameters 9 G R''. 

History features may describe any aspect of the history, including empirical 
reward moments, current time step, arm play counts or combinations of these 
variables. The set of such features should not be too large to avoid parameter 
estimation difficulties, but it should be large enough to provide the support for 
a rich set of E/E strategies. We here propose one possibility for defining the 
history feature function, that can be applied to any multi-armed problem and 
that is shown to perform well in Section [6] 

To compute 4>{Ht,k), we first compute the following four variables: vi = 
•\/lni, V2 — l/\/tk,v^ — Tfe and Vi = au, i.e. the square root of the logarithm of 
the current time step, the inverse square root of the number of times arm k has 
been played, the empirical mean and standard deviation of the rewards obtained 
so far by arm k. 

Then, these variables are multiplied in different ways to produce features. 
The number of these combinations is controlled by a parameter P whose default 
value is 1. Given P, there is one feature fi,j^k,i per possible combinations of 
values of i,j, k,l £ {0, . . . , P}, which is defined as follows: fi.j,k,i = vlv2V^V4^. 

In other terms, there is one feature per possible polynomial up to degree P us- 
ing variables vi, . . . ,V4. In the following, we denote Power-1 (resp., Power-2) 
the policy learned using function (j){Ht, k) with parameter P = \ (resp., P — 2). 
The index function that underlies these policies can be written as following: 

p p p p 

indexP°^'^-P{Ht,k) ^Y.Y.Y.Y. ^W^l^i^H (») 

1=0 j=0 fc=0 1=0 

where 9i,j^k,i are the learned parameters. The Power- 1 policy has 16 such pa- 
rameters and the Power- 2 has 81 parameters. 

4.2 Optimisation algorithm 

We now discuss the optimization of Equation [7] in the case of our numerical 
parameterization. Note that the objective function we want to optimize, in ad- 
dition to being stochastic, has a complex relation with the parameters 9. A 
slight change in the parameter vector 9 may lead to significantly different bandit 
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Algorithm 2 EDA-based learning of a discrete bandit policy 

Given the number of iterations imax , 

Given the population size Up, 

Given the number of best elements b, 

Given a sample of training bandit problems P^^\ . . . , P^^\ 
Given an history-features function ';/>(•,•) G R'*, 

Set fip — 0, ap — 1 , Vp G [1, d] t> Initialize with normal Gaussians 

for i G [l,imax] do 

for j G do t> Sample and evaluate new population 

for p G [1, d] do 

6p sample from Af{iip, o-p) 
end for 

Estimate Zi(7r6i) and store result {0,A{ne)) 
end for 

Select {6'^-*, . . . , 9^''^} the b best candidate 9 vectors w.r.t. their A(-) score 
Hp <^ ^ X]''=i ^p"'"' i^P £ [1' '^l Learn new Gaussians 

end for 

return The policy no that led to the lowest observed value of A{tvo) 



episodes and expected regret values. Local optimization approaches may thus 
not be appropriate here. Instead, we suggest the use of derivative-free global 
optimization algorithms. 

In this work, we use a powerful, yet simple, class of global optimization 
algorithms known as cross-entropy and also known as Estimation of Distribution 
Algorithms (EDA) [9 . ED As rely on a probabilistic model to describe promising 
regions of the search space and to sample good candidate solutions. This is 
performed by repeating iterations that first sample a population of rip candidates 
using the current probabilistic model and then fit a new probabilistic model 
given the b < rip best candidates. 

Any kind of probabilistic model may be used inside an EDA. The simplest 
form of EDAs uses one marginal distribution per variable to optimize and is 
known as the univariate marginal distribution algorithm jlO) . We have adopted 
this approach by using one Gaussian distribution J\f{^p, a'^) for each parameter 
9p. Although this approach is simple, it proved to be quite effective experimen- 
tally to solve Equation [Tj The full details of our EDA-based policy learning pro- 
cedure are given by Algorithm [2] The initial distributions are standard Gaussian 
distributions A/'(0, 1). The policy that is returned corresponds to the parame- 
ters that led to the lowest observed value of Z\(7rg). 

5 Symbolic parametrization 

The index functions from the literature depend on the current time step t and 
on three statistics extracted from the sub-history H^_l : fk, and tk- We 
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now propose a second parameterization of our learning approach, in which we 
consider all index functions that can be constructed using small formulas built 
upon these four variables. 

5.1 Policy search space 

We consider index functions that are given in the form of small, closed-form 
formulas. Closed-form formulas have several advantages: they can be easily com- 
puted, they can formally be analyzed and they are easily interpretable. 

Let us first explicit the set of formulas F that we consider in this paper. A 
formula F G F is: 

— either a binary expression F = B{F' , F"), where B belongs to a set of binary 
operators B and F' and F" are also formulas from F, 

— or a unary expression F = U{F') where U belongs to a set of unary operators 
U and F' e F, 

— or an atomic variable F — V, where V belongs to a set of variables V, 

— or a constant F — C, where C belongs to a set of constants C. 

In the following, we consider a set of operators and constants that provides a 
good compromise between high expressiveness and low cardinality of F. The set 
of binary operators considered in this paper B includes the four elementary math- 
ematic operations and the min and max operators: B = {+, — , x , min, max}. 
The set of unary operators U contains the square root, the logarithm, the ab- 
solute value, the opposite and the inverse: U = {.^7, ln(.), |.|,— ., -}. The set of 
variables V is: V = {r^, afe, tfc, t}. The set of constants C has been chosen to 
maximize the number of different numbers representable by small formulas. It 
is defined as C = {1, 2, 3, 5, 7}. 

Figure [T] summarizes our grammar of formulas and gives two examples of 
index functions. The length of a formula length{f) is the number of symbols 
occurring in the formula. For example, the length of -f 2/^^ is 5 and the 
length of fk + 1/2 X ln{t)/tk is 9. Let L be a given maximal length. is the 
subset of formulas whose length is no more than L: = {f\length{f) < L} 
and Uq is the set of index-based policies whose index functions are defined by 
formulas f E 0- 

5.2 Optimisation algorithm 

We now discuss the optimization of Equation [7] in the case of our symbolic 
parameterization. First, notice that several different formulas can lead to the 
same policy. For example, any increasing function of fk defines the greedy policy, 
which always selects the arm that is believed to be the best. Examples of such 
functions in our formula search space include ffc, rfc x 2, ffc x or ^f^. 

Since it is useless to evaluate equivalent policies multiple times, we propose 
the following two-step approach. First, the set is partitioned into equivalence 
classes, two formulas being equivalent if and only if they lead to the same policy. 
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F :■- B{F,F) I U{F) \ V \C 
_B ::= + I — I X I I min \ max 
U ::= sqrt \ In \ abs \ opposite \ inverse 
V :■— fk I CTfc I tfc I t 
C ::= 1,2,3,5,7 

Fig. 1. The grammar used for generating candidate index functions and two example 
formula parse trees corresponding to fk + 2/tk and fk + \j2ln(€) jtk- 

Then, Equation [t] is solved over the set of equivalence classes (which is typically 
one or two orders of magnitude smaller than the initial set 0). 

Partitioning O. This task is far from trivial: given a formula, equivalent for- 
mulas can be obtained through commutativity, associativity, operator-specific 
rules and through any increasing transformation. Performing this step exactly 
involves advanced static analysis of the formulas, which we believe to be a very 
difficult solution to implement. Instead, we propose a simple approximated so- 
lution, which consists in discriminating formulas by comparing how they rank 
(in terms of values returned by the formula) a set of d random samples of the 
variables r]i;,ak,tk,t. More formally, the procedure is the following: 

1. we first build 69, the space of all formulas / such that length{f) < L; 

2. for i ~ 1 . . . d, we uniformly draw (within their respective domains) some 
random realizations of the variables fk,a'k,tk,t that we concatenate into a 
vector 0i; 

3. we cluster all formulas from according to the following rule: two formulas 
F and F' belong to the same cluster if and only if they rank all the 0i 
points in the same order, i.e.: Vi,j € {1, . ■ . ,d},i ^ j,F{0i) > F{0j) ■4=4> 
F'{0i) > F'{0j). Formulas leading to invalid index functions (caused for 
instance by division by zero or logarithm of negative values) are discarded; 

4. among each cluster, we select one formula of minimal length; 

5. we gather all the selected minimal length formulas into an approximated 
reduced set of formulas 0. 

In the following, we denote by M the cardinality of the approximate set of 
formulas © = {/i, . • . , /m}- 

Optimization algorithm. A naive approach for finding the best formula f*£0 
would be to evaluate A{f) for each formula / G 6* and simply return the best 
one. While extremely simple to implement, such an approach could reveal itself 
to be time-inefficient in case of spaces of large cardinality. 

Preliminary experiments have shown us that contains a majority of formu- 
las leading to relatively bad performing index-based policies. It turns out that 
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relatively few samples of -Rp(i) ^ are sufficient to reject with high confidence 
these badly performing formulas. In order to exploit this idea, a natural idea 
is to formalize the search for the best formula as another multi-armed bandit 
problem. To each formula Fk € 0, we associate an arm. Pulling the arm k con- 
sists in selecting a training problem P*^'^ and in running one episode with the 
index-based policy whose index formula is fk- This leads to a reward associated 
to arm k whose value is the quantity — i?p(i, j, observed during the episode. The 
purpose of multi-armed bandit algorithms is here to process the sequence of ob- 
served rewards to select in a smart way the next formula to be tried so that when 
the budget of pulls has been exhausted, one (or several) high-quality formula(s) 
can be identified. 

In the formalization of Equation [7] as a multi-armed bandit problem, only 
the quality of the finally suggested arm matters. How to select arms so as to 
identify the best one in a finite amount of time is known as the pure exploration 
multi-armed bandit problem [11]. It has been shown that index-based policies 
based on upper confidence bounds were good policies for solving pure exploration 
bandit problems. Our optimization procedure works as follows: we use a bandit 
algorithm such as UCBI-Tuned during a given number of steps and then return 
the policy that corresponds to the formula fk with highest expected reward fk- 
The problem instances are selected depending on the number of times the arm 
has been played so far: at each step, we select the training problem P^"^^ with 
i = l + [tk mod N). 

In our experiments, we estimate that our multi-armed bandit approach is one 
hundred to one thousand times faster than the naive Monte Carlo optimization 
procedure, which clearly demonstrates the benefits of this approach. Note that 
this idea could also be relevant to our numerical case. The main difference is that 
the corresponding multi-armed bandit problem relies on a continuous-arm space. 
Although some algorithms have already been proposed to solve such multi-armed 
bandit problems [H], how to scale these techniques to problems with hundreds 
or thousands parameters is still an open research question. Progresses in this 
field could directly benefit our numerical learning approach. 

6 Numerical experiments 

We now illustrate the two instances of our learning approach by comparing 
learned policies against a number of generic previously proposed policies in a 
setting where prior knowledge is available about the target problems. We show 
that in both cases, learning enables to obtain exploration/exploitation strategies 
significantly outperforming all tested generic policies. 

6.1 Experimental protocol 

We compare learned policies against generic policies. We distinguish between 
untuned generic policies and tuned generic policies. The former are either policies 
that are parameter-free or policies used with default parameters suggested in 
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the literature, while the latter are generic policies whose hyper-parameters were 
tuned using Algorithm [2] 

Training and testing. To illustrate our approach, we consider the scenario 
where the number of arms K, the playing horizon T and the kind of distribu- 
tions Vk are known a priori and where the parameters of these distributions are 
missing information. Since we are learning policies, care should be taken with 
generalization issues. As usual in supervised machine learning, we use a train- 
ing set which is distinct from the testing set. The training set is composed of 
A'^ = 100 bandit problems sampled from a given distribution over bandit prob- 
lems Vp whereas the testing set contains another 10000 problems drawn from 
this distribution. To study the robustness of our policies w.r.t. wrong prior infor- 
mation, we also report their performance on a set of 10000 problems drawn from 
another distribution Vp with different kinds of distributions Vk ■ When comput- 
ing Z\(7re), we estimate the regret for each of these problems by averaging results 
overs 100 runs. One calculation of A^Trg) thus involves simulating 10"* (resp. 10^) 
bandit episodes during training (resp. testing). 

Problem distributions. The distribution Vp is composed of two-armed bandit 
problems with Bernoulli distributions whose expectations are uniformly drawn 
from [0, 1]. Hence, in order to sample a bandit problem from Vp, we draw the 
expectations pi and p2 uniformly from [0, 1] and return the bandit problem 
with two Bernoulli arms that have expectations pi and p2, respectively. In the 
second distribution Vp, the reward distributions h'k are changed by Gaussian 
distributions truncated to the interval [0,1]. In order to sample one problem 
from Vp, we select a mean and a standard deviation for each arm uniformly 
in range [0, 1]. Rewards are then sampled using a rejection sampling approach: 
samples are drawn from the corresponding Gaussian distribution until obtaining 
a value that belongs to the interval [0, 1]. 

Generic policies. We consider the following generic policies: the e„-GREEDY 
pohcy as described in [1], the policies introduced by [1]: UCBl, UCBI-Tuned, 
UCBl-NORMAL and UCB2, the policy KL-UCB introduced in [13] and the pol- 
icy UCB-V proposed by [S]. Except e„-GREEDY, all these policies belong to the 
family of index-based pohcies discussed previously. UCBI-Tuned and UCBl- 
NoRMAL are parameter-free policies designed for bandit problems with Bernoulli 
distributions and for problems with Gaussian distributions respectively. All the 
other policies have hyper-parameters that can be tuned to improve the quality 
of the policy. e„-GREEDY has two parameters c > and < d < 1, UCB2 has 
one parameter < a < 1, KL-UCB has one parameter c > and UCB-V has 
two parameters C > and c > 0. We refer the reader to |4|5|13j for detailed 
explanations of these parameters. 

Learning numerical policies. We learn policies using the two parameteriza- 
tions Power- 1 and Power-2 described in Section [4?H Note that tuning generic 
policies is a particular case of learning with numerical parameters and that both 
learned policies and tuned generic policies make use of the same prior knowledge. 
To make our comparison between these two kinds of policies fair, we always use 
the same training procedure, which is Algorithm [2] with imax = 100 iterations. 
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Up = maa;(8d, 40) candidate policies per iteration and b = Up/A best elements, 
where d is the number of parameters to optimize. Having a linear dependency 
between Up and d is a classical choice when using EDAs [T3]. Note that, in most 
cases the optimization is solved in a few or a few tens iterations. Our simulations 
have shown that imax = 100 is a careful choice for ensuring that the optimiza- 
tion has enough time to properly converge. For the baseline policies where some 
default values are advocated, we use these values as initial expectation of the 
EDA Gaussians. Otherwise, the initial Gaussians are centered on zero. Nothing 
is done to enforce the EDA to respect the constraints on the parameters (e.g., 
c > and < d < 1 for e„-GREEDY). In practice, the EDA automatically 
identifies interesting regions of the search space that respect these constraints. 

Learning symbolic policies. We apply our symbolic learning approach with a 
maximal formula length of L = 7, which leads to a set of \0\ ~ 33,5 millions 
of formulas. We have applied the approximate partitioning approach described 
in Section [5?2| on these formulas using d = 1024 samples to discriminate among 
strategies. This has resulted in « 9, 5 million invalid formulas and M = 99020 
distinct candidate E/E strategies (i.e. distinct formula equivalence classes). To 
identify the best of those distinct strategies, we apply the UCBI-Tuned algo- 
rithm for 10^ steps. In our experiments, we report the two best found policies, 
which we denote Formula-1 and Formula-2. 

6.2 Performance comparison 

Table [l] reports the results we obtain for untuned generic policies, tuned generic 
policies and learned policies on distributions Vp and Vp with horizons T € 
{10, 100, 1000}. For both tuned and learned policies, we consider three different 
training horizons {10,100,1000} to see the effect of a mismatch between the 
training and the testing horizon. 

Generic policies. As already pointed out in [1], it can be seen that UCBI- 
Tuned is particularly well fitted to bandit problems with Bernoulli distributions. 
It also proves effective on bandit problems with Gaussian distributions, making 
it nearly always outperform the other untuned policies. By tuning UCBl, we 
outperform the UCBI-Tuned policy (e.g. 4.91 instead of 5.43 on Bernoulli 
problems with T = 1000). This also sometimes happens with UCB-V. However, 
though we used a careful tuning procedure, UCB2 and e„-GREEDY do never 
outperform UCBI-Tuned. 

Learned policies. We observe that when the training horizon is the same as 
the testing horizon T, the learned pohcies (Power-1, Power-2, Formula-1 
and FORMULA-2) systematically outperform all generic policies. The overall best 
results are obtained with Power-2 policies. Note that, due to their numerical 
nature and due to the large number of parameters, these policies are extremely 
hard to interpret and to understand. The results related to symbolic policies 
show that there exist very simple policies that perform nearly as well as these 
black-box policies. This clearly shows the benefits of our two hypothesis spaces: 
numerical policies enable to reach very high performances while symbolic policies 
provide interpretable strategies whose behavior can be more easily analyzed. This 
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Policy 


Training 


Parameters 


Bernoulli 


Gaussian 




Horizon 




T=10 T=100 T=1000 


T=10 T=100 T=1000 



Untuned generic policies 



UCBl 


C = 2 


1.07 


5.57 


20.1 


1.37 


10.6 


66.7 


UCBl-TUNED 




0.75 


2.28 


5.43 


1.09 


6.62 


37.0 


UCBl-NORMAL 




1.71 


13.1 


31.7 


1.65 


13.4 


58.8 


UCB2 


a = 10"^ 


0.97 


3.13 


7.26 


1.28 


7.90 


40.1 


UCB-V 


c=l,C = l 


1.45 


8.59 


25.5 


1.55 


12.3 


63.4 


KL-UCB 


c = 


0.76 


2.47 


6.61 


1.14 


7.66 


43.8 


KL-UCB 


c = 3 


0.82 


3.29 


9.81 


1.21 


8.90 


53.0 


e„-GREEDY 


c= l,d= 1 


1.07 


3.21 


11.5 


1.20 


6.24 


41.4 



Tuned generic policies 



T=10 

UCBl T=ioo 

T=1000 


C = 0.170 
C = 0.173 
C = 0.187 


0.74 2.05 4.85 
0.74 2.05 4.84 
0.74 2.08 4.91 


1.05 6.05 32.1 
1.05 6.06 32.3 
1.05 6.17 33.0 


T=10 

UCB2 T=ioo 

T=1000 


a = 0.0316 
a = 0.000749 
a = 0.00398 


0.97 3.15 7.39 
0.97 3.12 7.26 
0.97 3.13 725 


1.28 7.91 40.5 
1.33 8.14 40.4 
1.28 7.89 40.0 


T=10 

UCB-V T=ioo 

T=1000 


c= 1.542, C = 0.0631 
c= 1.681, C = 0.0347 
c = 1.304, C = 0.0852 


0.75 2.36 5.15 
0.75 2.28 7.07 
0.77 2.43 5.14 


1.01 5.75 26.8 
1.01 5.30 27.4 

1.13 5.99 27.5 


T=10 

KL-UCB T=ioo 

T=1000 


c= -1.21 
c= -1.82 
c= -1.84 


0.73 2.14 5.28 
0.73 2.10 5.12 
0.73 2.10 5.12 


1.12 7.00 38.9 
1.09 6.48 36.1 
1.08 6.34 35.4 


T=10 

e„-GREEDY T=100 
T=1000 


c = 0.0499, d= 1.505 
c = 1.096, d = 1.349 
c = 0.845, d = 0.738 


0.79 3.86 32.5 
0.95 3.19 14.8 
1.23 3.48 9.93 


1.01 7.31 67.6 

1.12 6.38 46.6 
1.32 6.28 37.7 



Learned numerical policies 



T=10 

POWER-1 T=100 
T=1000 


(16 parameters) 


0.72 2.29 14.0 
0.77 1.84 5.64 
0.88 2.09 4.04 


0.97 5.94 49.7 
1.04 5.13 27.7 
1.17 5.95 28.2 


T=10 

POWER-2 T=100 
T=1000 


(81 parameters) 


0.72 2.37 15.7 
0.76 1.82 5.81 
0.83 2.07 3.95 


0.97 6.16 55.5 
1.05 5.03 29.6 
1.12 5.61 27.3 


Learned symbolic policies 


T=10 

FORMULA-1 T=100 
T=1000 


Vt^irk - 1/2) 
fk + l/{tk + 1/2) 
rk + V{tk + 2) 


0.72 2.37 14.7 
0.76 1.85 8.46 
0.80 2.31 4.I6 


0.96 5.14 30.4 
1.12 5.07 29.8 
1.23 6.49 26.4 


T=10 

FORMULA-2 T=100 
T=1000 


\rk - l/{tk + t)\ 

fk + min{l/tk,log{2)) 
1/tfe - l/{fk - 2) 


0.72 2.88 22.8 
0.78 1.92 6.83 
1.10 2.62 4.29 


1.02 7.15 66.2 

1.17 5.22 29.1 
1.38 6.29 26.1 



Table 1. Mean expected regret of untuned, tuned and learned policies on Bernoulli 

and Gaussian bandit problems. Best scores in each of these categories are shown in 
bold. Scores corresponding to policies that are tested on the same horizon T than the 
horizon used for training/tuning are shown in italics. 
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Policy 


T = 10 T = 100 T = 1000 


Policy 


T = 10 T = 100 T = 1000 


Generic policies 




Learned policies 




UCBl 


48.1 % 


78.1 % 


83.1 % 


Power- 1 


54.6 % 


82.3 % 


91.3 % 


UCB2 


12.7 % 


6.8 % 


6.8 % 


Power- 2 


54.2 % 


84.6 % 


90.3 % 


UCB-V 


38.3 % 


57.2 % 


49.6 % 


Formula- 1 


61.7 % 


76.8 % 


88.1 % 


KL-UCB 


50.5 % 


65.0 % 


67.0 % 


Formula- 2 


61.0 % 


80.0 % 


73.1 % 


e„-GREEDY 


37.5 % 


14.1 % 


10.7 % 











Table 2. Percentage of wins against UCBI-Tuned of generic and learned policies. 
Best scores are shown in bold. 



interpretability/performance tradeoff is common in machine learning and has 
been identified several decades ago in the field of supervised learning. It is worth 
mentioning that, among the 99020 formula equivalence classes, a surprisingly 
large number of strategies outperforming generic policies were found: when T = 
100 (resp. T = 1000), we obtain about 50 (resp. 80) different symbolic pohcies 
outperforming the generic policies. 

Robustness w.r.t. the horizon T. As expected, the learned policies give their 
best performance when the training and the testing horizons are equal. Policies 
learned with large training horizon prove to work well also on smaller horizons. 
However, when the testing horizon is larger than the training horizon, the quality 
of the policy may quickly degrade (e.g. when evaluating Power- 1 trained with 
T = 10 on an horizon T = 1000). 

Robustness w.r.t. the kind of distribution. Although truncated Gaussian dis- 
tributions are significantly different from Bernoulli distributions, the learned 
policies most of the time generalize well to this new setting and still outperform 
all the other generic policies. 

A word on the learned symbolic policies. It is worth noticing that the best 
index-based policies (Formula- 1) found for the two largest horizons (T = 100 
and T = 1000) work in a similar way as the UCB-type policies reported earlier in 
the literature. Indeed, they also associate to an arm k an index which is the sum 
of fk and of a positive (optimistic) term that decreases with tk . However, for the 
shortest time horizon (T — 10), the policy found {y/tk{fk — 5)) is totally different 
from UCB-type policies. With such a policy, only the arms whose empirical 
reward mean is higher than a given threshold (0.5) have positive index scores 
and are candidate for selection, i.e. making the scores negative has the effect 
to kill bad arms. If the ffc of an arm is above the threshold, then the index 
associated with this arm will increase with the number of times it is played and 
not decrease as it is the case for UCB policies. If all empirical means fk are below 
the threshold, then for equal reward means, arms that have been less played are 
preferred. This finding is amazing since it suggests that this optimistic paradigm 
for multi-armed bandits upon which UCB policies are based may in fact not be 
adapted at all to a context where the horizon is small. 

Percentage of wins against UCBl-TuNED. Table[2] gives for each policy, its 
percentage of wins against UCBI-Tuned, when trained with the same horizon 
as the test horizon. To compute this percentage of wins, we evaluate the expected 
regret on each of the 10000 testing problems and count the number of problems 
for which the tested policy outperforms UCBI-Tuned. We observe that by 
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minimizing the expected regret, our learned policies also reach high values of 
percentage of wins: 84.6 % for T = 100 and 91.3 % for T = 1000. Note that, in 
our approach, it is easy to change the objective function. So if the real applicative 
aim was to maximize the percentage of wins against UCBI-Tuned, this criterion 
could have been used directly in the policy optimization stage to reach even 
better scores. 

6.3 Computational time 

We used a C++ based implementation to perform our experiments. In the nu- 
merical case with 10 cores at 1.9Ghz, performing the whole learning of Power-1 
took one hour for T = 100 and ten hours for T = 1000. In the symbolic case 
using a single core at 1.9Ghz, performing the whole learning took 22 minutes 
for T = 100 and a bit less than three hours for T = 1000. Note that the fact 
that symbolic learning is much faster can be explained by two reasons. First, we 
tuned the EDA algorithm in a very careful way to be sure to find a high quality 
solution; what we observe is that by using only 10% of this learning time, we al- 
ready obtain closc-to-optimal strategics. The second factor is that our symbolic 
learning algorithm saves a lot of CPU time by being able to rapidly reject bad 
strategies thanks to the multi-armed bandit formulation upon which it relies. 

7 Conclusions 

The approach proposed in this paper for exploiting prior knowledge for learning 
exploration/exploitation policies has been tested for two-armed bandit prob- 
lems with Bernoulli reward distributions and when knowing the time horizon. 
The learned policies were found to significantly outperform other policies pre- 
viously published in the literature such as UCBl, UCB2, UCB-V, KL-UCB 
and e„- Greedy. The robustness of the learned policies with respect to wrong 
information was also highlighted, by evaluating them on two-armed bandits with 
truncated Gaussian reward distribution. 

There are in our opinion several research directions that could be investigated 
for still improving the algorithm for learning policies proposed in this paper. For 
example, we found out that problems similar to the problem of overfitting met 
in supervised learning could occur when considering a too large set of candidate 
polices. This naturally calls for studying whether our learning approach could 
be combined with rcgiilarization techniques. Along this idea, more sophisticated 
optimizers could also be thought of for identifying in the set of candidate policies, 
the one which is predicted to behave at best. 

The UCBl, UCB2, UCB-V, KL-UCB and e„-GREEDY policies used for 
comparison were shown (under certain conditions) to have interesting bounds 
on their expected regret in asymptotic conditions (very large T) while we did 
not aim at providing such bounds for our learned policies. It would certainly be 
relevant to investigate whether similar bounds could be derived for our learned 
policies or, alternatively, to see how the approach could be adapted so as to 
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target policies offering such theoretical performance guarantees in asymptotic 
conditions. For example, better bounds on the expected regret could perhaps 
be obtained by identifying in a set of candidate policies the one that gives the 
smallest maximal value of the expected regret over this set rather than the one 
that gives the best average performances. 

Finally, while our paper has provided simulation results in the context of 
the most simple multi-armed bandit setting, our exploration/exploitation policy 
meta-learning scheme can also in principle be applied to any other exploration- 
exploitation problem. In this line of research, the extension of this investigation 
to (finite) Markov Decision Processes studied in [ISj, suggests already that our 
approach to meta-learning E/E strategies can be successful on much more com- 
plex settings. 
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